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Abstract 

Consider a family of sets and a single set, called the query set. How 
can one quickly find a member of the family which has a maximal in- 
tersection with the query set? Time constraints on the query and on a 
possible preprocessing of the set family make this problem challenging. 
Such maximal intersection queries arise in a wide range of applications, 
including web search, recommendation systems, and distributing on-line 
advertisements. In general, maximal intersection queries are computation- 
ally expensive. We investigate two well-motivated distributions over all 
families of sets and propose an algorithm for each of them. We show that 
with very high probability an almost optimal solution is found in time 
which is logarithmic in the size of the family. Moreover, we point out 
a threshold phenomenon on the probabilities of intersecting sets in each 
of our two input models which leads to the efficient algorithms mentioned 
above. 

1 Introduction 

The nearest neighbor problem is the task to determine in a general metric space 
a point that is closest to a given query point. This kind of queries appear in 
a huge number of applied problems: text classification, handvifriting recognition, 
recommendation systems, distributing on-line advertisements, near-duplicate 
detection, and code plagiarism detection. 

In this paper we consider the nearest neighbor problem in a "binary" form. 
Namely, every object is described as a set of its features and similarity is de- 
fined as the number of common features. In order to construct an efficient 
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solution some assumptions should be added to the problem. Here we assume 
that the input behaves according to some predefined distribution. Then we 
construct an algorithm and show that the time complexity and / or the accuracy 
are reasonably good with high probability. Here we use the probability over the 
input distribution, not over random choices of the algorithm. This probabilistic 
approach was inspired by the recent survey of Newman [TH] . He gives a compre- 
hensive survey about random models of graphs that agree well with many real 
life networks, including Web graphs, friendship graphs, co-authorship graphs, 
and many others. Hence, we can attack the nearest neighbor problem in already 
"verified" random models. 

The Maximal Intersection Problem. Consider a family of sets and a single 
set. We ask for a member of the set family which has a maximal intersection 
with the query set. 



The Maximal Intersection Problem (Maxint) 

Database: A family F oi n sets such that |/| < A; for all f £ F. 

Query: Given a set fnew with |/„eui| < k, return fi £ J- with maximal 

I fnew ^ fi\- 

Constraints: Preprocessing time n ■ (logn)'^^^^ ■ k'^^^\ 
Query time (logn)'^(i) • fc^^^l 

Let us restate the problem in a graph theoretical notation which will allow a 
more convenient description of some applications of MaxInt later. A database 
is a bipartite graph with vertex set partition {V, V) such that \V\ ~ n and the 
degree of every u S is at most k. A query is a (new) vertex v (together with 
edges connecting v with V') of degree at most k. The query task is to return a 
vertex u € V with a maximal number of paths of length 2 from v to u. 

Our main motivation for studying MaxInt was problems like text cluster- 
ing, near-duplicate detection or distribution of on-line advertisements. In these 
problems, the database mainly consists of natural language text documents. 
Therefore, we will deal in the rest of the paper with documents and terms in- 
stead of sets and elements. On the other hand, we want to stress that our ideas 
and algorithms can be applied to every input following our models. Note that 
in this work documents are not considered to be multisets of terms. But, as we 
will see in Section [2l we use the fact that every term in a document occurs with 
a certain multiplicity. 

Results. In Section 2 and Section 3 we propose two new randomized input 
models for MaxInt, called the Zipf model and the hierarchical scheme. As- 
sume that the terms of a query document are ordered by their frequency in the 
document collection. Now consider the probability curves for the two following 



2 



events with parameter q (Figure 1). Any q-match: there is a document in the 
random (according to our models) collection that has at least q common terms 
with the query document (the solid curve). Prefix q-match: there is a document 
in the random collection that has at least the first q terms (according to the 
order given by the term frequencies) of the query document (the dashed curve). 
Both curves have the similar structure: the probability is close to 1 for small q, 
but suddenly, at some "matching level" , it falls to nearly zero. Our main ob- 

Prob 




matching level 



Figure 1: Exemplary probability curves for any g-match and prefix g-match 

servation is that these matching levels for prefix g-match and any g-match are 
very close to each other. And this is extremely important for solving MaxInt. 
Indeed, finding the best prefix g-match is computationally feasible. We show 
that closeness of matching levels for any g-match and prefix g-match with high 
probability allows to find an approximate solution for MaxInt. 

With respect to the conference version of this paper we generalize the match- 
ing level theorem from regular queries to any query taken from the Zipf model 

m- 

Applications. The MaxInt problem is a natural formalization for many 
practical problems: 

Long search queries. Consider a bipartite graph representing websites and 
their words, that is, every website is represented as a set of words. Let a 
query be a moderately large set of words (say, 100 words). For example, 
one might get a large query by expanding a five word query by adding all 
synonyms. The search for a website containing ALL query terms might 
produce no result. Hence, a search for a website that has maximal in- 
tersection with the query set is a natural alternative. Therefore, efficient 
algorithms for MaxInt can help web search engines like google . conQ, 
lyeihoo ■ coin[ or imsn . comi to relax their restrictions on the query length. 

^The reference time for all links mentioned in this paper is April 2007. 
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Content-based similarity. Consider a bipartite graph representing documents 
and tlieir terms. Finding a document in a database that has a maximal 
number of common terms with a newcomer document might be a basic rou- 
tine for text clustering/classification and duplicates detection. Particular 
examples are news classification systems (jreuters . com|) . news clustering 
(news . google . coml), and spam detection. 

New connection suggestions. Consider an undirected graph between people 
representing for example friendship or co-authorship. Here, every person 
is described by his name and a list of all his friends. Then, applying a 
MaxInt query to a (new) person we get a natural suggestion for establish- 
ing a new connection for her. Indeed, we get a person that has a maximal 
number of joint friends with the query person. Related systems can be 
found at linkedin. com (for acquaintances) and dblp . uni-trier . de (for 
co-authorship). 

Co-occurrence similarity. Consider an audience graph between people and 
some items. Every item is represented by a set of people who are interested 
in it. Take a (new) item together with its audience. Then, the MaxInt 
query returns an item that has the maximal co-occurrence with the query 
item in people's preference lists. Particular examples are the music band 
similarity by their listeners (last . f m) and RSS-feeds similarity by their 
subscribers fl bloglines . comj and lf eedburner . com)) . 

Advertisement Matching. Delivering advertisement relevant to users inter- 
ests is one of the most important problems in web technologies [15] . Max- 
Int can reflect this challenge in a natural way: Consider a graph represent- 
ing websites participating in some ad distribution system and their terms. 
A query is a set of terms that describe some advertisement and its target 
audience. Here, a solution of MaxInt suggests a website that is among 
the best candidates to display the given ad. See google . com/adsense as 
an example system for ad distribution. 

Social recommendations. Consider a bipartite graph between people and 
their recommendations (e.g. for books, bars, cars). A query is a set of 
friends of some newcomer person. Finding an item that is already chosen 
by many of the newcomer's friends is a natural form of recommendation. 
The friendship graph together with recommended items is for example 
accumulated on f acebook . com l 

Note that for some of these applications the Jaccard similarity coefficient 
may also be an appropriate similarity measure. 



Related Work. MaxInt is a special case of the nearest neighbor problem. 
Indeed, one just needs to define the similarity between two documents as the 
number of common words. There is also a way to define a metric (i.e. distance 
function satisfying the triangle inequality) providing the reverse similarity order. 
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To do this, we need to add some unique "imaginary" words to every document 
making their size equal and then use the Jaccard metric [2]. Denoting the 
maximal cardinality of a document in the collection by M the resulting formula 
is d{A, B) = '^■2lj'^\AnB\ ■ Instead of the Jaccard metric one could use the size of 
the symmetric difference as well (again, one has to add unique words to every 
document making their size equal). This defines again a metric which provides 
the reverse similarity order. 

Many efficient algorithms have been developed for nearest neighbor search 
in special cases or under various assumptions; see recent survey papers [TJ HI 
[SI [HI H] and the book [TO] for comprehensive reviews. Nearest neighbors are 
particularly well studied in vector models with the Euclidean distance function 
|13[ [7]. Actually, we can interpret a document as a vector of Os and Is (1 means 
a term is contained in a document). Then, the scalar product is equal to the size 
of the intersection. Unfortunately, random projection methods studied are not 
directly applicable to MaxInt. Namely, (1) we do not allow that the complexity 
is linear in the vector length, and (2) a c-approximate solution for the Euclidean 
distance is not necessarily a c-approximate solution for the size-of-intersection 
similarity. Note that the length of the vectors (resulting from the overall number 
of different terms in the document collection) can be much larger than the size 
of the document collection. 

Closely related to MaxInt is text search. Finding documents that fit best 
to some given search terms can also be considered as a problem on a bipartite 
graph. The documents and terms are the nodes and edges are drawn when a 
term occurs in a document. Basically the task is to find all documents containing 
every query term and rank these documents by relevance. The key technique 
in this area is inverted files (inverted indexing) . A comprehensive survey of the 
topic can be found in [20] . 

2 MaxInt in the Zipf Model 

Let T = {ii, . . . , t„i} be a set of terms and T) = p(T) be the power set of T, 
called documents. A document collection I?„ is a subset {c?i, . . . ,dn\ C P. We 
demand m g n''-"^^\ In the following we will use the terms prefix match and any 
match instead of prefix g-match and any g-match since the size of a matching 
is always stated explicitly. By log we always mean log2, while In denotes logg. 

We now describe a probabilistic mechanism for generating a document col- 
lection called the Zipf model. Every document is generated independently. Term 
occurrences are also independent. A document contains term ti with probability 
l/i. Hence, the expected number of terms in a document is approximately equal 
to Inm in our model. This model is similar to the configuration model (|18j) 
with Zipf 's law for distribution of term degrees and constant document degrees. 
Zipf's law states that in natural language texts the frequency / of a word is 
approximately inversely proportional to its rank r in the frequency table, i.e. 
there exists a constant c such that f ■ r k, c (Table 1). For more details about 
Zipf's law see [T7] . 
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Table 1: Empirical evaluation of Zipf's law on Tom Sawyer 



Remark 1. The frequency of a term Mn a collection !)„ of documents is defined 
as 

\{deVn\ted}\ 
n 

The expected frequency of the term ti is equal to At the same time, 

the expected frequency rank for ti is exactly the i-th value among those of all 
terms. So the Zipf model reflects in a natural way Zipf's law. Since some of 
our motivating applications also deal with natural language texts, we can state 
that the Zipf model agrees with real life at least by degree distribution. 

Remark 2. By defining the probability of the term to be contained in a 
document as the set V yields a probability space where a document d is an 

event that occurs with probability P{d) = (Htjed j) (llti^d. ■'^ ~ f)- 
In the following proofs we will use two inequalities (a, 6 > 0): 

(l-^)''<e-^ a<b (2.1) 

(l-l)">l-i, a,b>l (2.2) 

Indeed, let g{x) = ln(l — a;)/x, < a; < 1. Notice that g is a decreasing function. 
The first inequality follows from g{a/b) < liisix^o g{x) = —1, while the second 
one is equivalent to g{l/b) < g{l/ab). 

For further considerations we introduce the following terms and definitions: 
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Definition 2.1. Let 

^3 t4 ^5 ^6 ^7 
Pi P2 

be a partition of the set of terms. The group Pj includes terms from i|-ei-i] to 
t^eij . We say that a document e is regular if it contains exactly Inm terms 
Pi ■ ■ - Pinm such that Pi G Pi. 

Remark 3. Note that the expected number of terms in each group Pi is approx- 
imately one. If Inm is not an integer then the index of the last group is [Inm] . 
In this case the expected number of terms in this group is smaller than one. To 
make the following proofs more legible we do not demand that the number of 
terms of a document or the matching size is an integer. In real settings these 
values have to be rounded appropriately. 

Definition 2.2. Let < i5 < 1. We say that a document d gV is 5-n-generic 
if the following holds: 

Vi>^\/21nn: \{tj & d\ j < e%> {I - 6)i. 

Lemma 2.3. Let Q < 5 <1 and c = e~*^/^. Let d gV be a random document 

following the Zipf model. It holds that for a sufficiently large n the probability 
that d is 6-n-generic is greater than 1 — c^-^''^/^/ (1 — c). 

Proof. Let X be a random variable denoting the expected number of terms in 
d up to the term te^. For a fixed i the Chernoff bound P{X < (1 — 6)EX) < 
^-Exs /2 yjgi(j ^jja,^ ^jjg probability that d contains less than (1 — 5)i terms up 
to the term tgi is smaller than e~'* This holds since i < EX and therefore 

PiX < (1 - S)i) < P{X < (1 - 5)EX) < e-E^*'/2 < e-'*'/2. 
So the probability that d is not 5-n-generic is for large n bounded by 

E^'- E = T^-^-T 

^ ^ 1-c 1-c 

i>0 j=0 
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< 
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This holds because 1.4 < \/2 and n is large. Overall, the probability that d 
contains (1 — 5)i or more terms is greater than 1 — c^-'^^'-/^^ j (\ _ c). □ 
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Lemma 2.4. Let d T) be a random document following the Zipf model and 
let < 6 < 1 and e-^ If we insert the first Inn missing terms to 
d (assuming that there are missing terms), then for a sufficiently large n the 
following holds: 

P {^i < V2W : \{tj ed\j < e'}| > ^) > 1 ~ c^-'^sVT^/^i _ 

Proof. Since we insert the first S\/2\nn missing terms to d, the number of 
terms in d is always at least 6^2 Inn. Thus, the statement holds trivially for 
i < 5y/2 In n. So let's consider the case i > S\/2\nn. By Lemma [2.31 we know 
that the probability that d is (5- n- generic is greater than 1 — ci-4'5Vinn^j--^ _ ^-j 
for large n. Now, by inserting the first 6\/2\nn missing terms to d, we see that 
the probability that there are at least i terms tj with j < e* is greater than 

We now introduce a threshold, called matching level, to give statements 
about the most probable size of a maximal intersection: 

q = 9ri := V 2 In n. 

Theorem 2.5 (Matching Level for the Zipf Model). Let T>n — {di, . . . ,dn} be 
a document collection following the Zipf model. 

1. (Prefix match). Let < 5 < \ be fixed. Let 7 = 2 + 5\/2 Inn and c = 
g-(5 /2 jpg^ sufficiently large n,m the following holds: The probability 
that there exists a document in 2?„ that contains the first q ~ j terms 
of a query document dnew G 2? following the Zipf model is greater than 
1 — c''^'""/(l — c). Thus, the probability tends to one as n ^ 00. 

2. (Any match). Let e > be fixed. The probability that there exists a docu- 
ment in 'Dn that contains more than (1 + e)q terms of a query document 
dnew G 2? following the Zipf model tends to zero as n 00. 



Proof. 1. Let da be a fixed regular document fPefinition 12. ip . The proba- 
bility that a document from I?„ contains the prefix of length q — 2 oi du 
is at least 

111 1 

> 



g9-2 ^ g(q-l)2/2 g(g2-2g+l)/2 



n 

Note that e^^^/^ < „ gi^ce e'^-i) /2 > L This means that the probability 
that there exists no document in 2?„ that contains the {q — 2)-prefix of d/j 
is no more than 

1 < e 
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which follows from inequality (I2.ip . So with probability greater than 1 — 
e"*^' there exists a document in 2?„ that has all terms from the (g — 2)- 
prefix of dji. Consider dnew If we insert the first 5y/2\nn missing terms 
to dnew , Lemma 12.41 implies that for large n the probability that dnew 
contains in every group Pi, i < \/2 Inn, at least as many terms as da is 
greater than 1 — c^-'*'^^'""/ (1 — c). Therefore, the probability that there 
exists a document in 2?„ that matches the prefix of length q — 2 of the 
extended query document dnew is greater than 

For large n, this product is at least 1 — c*^'""/ (1 — c). It remains to notice 
that by removing the initially inserted 5\/2\an terms from dnew we still 
match (7 — 7 terms. This concludes the proof. 



2. Let us fix a query document dnew for now. Let d be a random document 
following the Zipf model and let A > 0. We can evaluate the Laplace 
transform of the intersection size as follows: 

£;exp(A|d„e™nd|) = W U-- + -eA< W (14 

- n n ^0+^ 

It follows that 

ln£;exp(A|d„e«, nd|) < ~~ + H A 

= e^Ti + AT2 -T3 + e-^Ti, 

where 

Tl- ^ i T2=\dnewn[l,e% 



J- 



We will assume that our fixed query dnew satisfies the following four reg- 
ularity conditions. 

e^Ti<eX^, T2<{l + s)X, 
T3>(l-e)y, e-^n<e\\ 
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Under these regularity conditions we obtain 

lnEexp{X\d„ewr,d\) < (l + 7£)y . 

Assuming (dj) to be a sample of n independent documents distributed 
according to the Zipf model, by the Chebyshev exponential inequality, for 
any r > we have 

P I max \dnew n I > r I <n P {\dnew n <i| > r) 

Eex.p{X\dnew <^d\) 

< nexp ^(1 + 7£) Y ~ 

Take any 7 > 0. By choosing now 

r = (\/21nn + 7)(l + 7e) 

and 

A = \/21nn + 7 

we obtain (1 + 7e)A^ = Ar, hence 

P ^nmx \dnew n dj I > < n exp ^-(1 + 7e)^ 



/ (^/2h^ + 7)M „ 
< n exp --^^ — 



for n — > 00, as required. Let now the query dnew be randomly chosen 
according to the Zipf model. For a sufficiently large A it is true that 



= E4 



In ,7 A2 in^j 



ETi = ^l<e 



A 



Let us explain the bound for ET^. Let 2 < 6 < >S be integers. 

Eln j \nx , In j In^ 5 In^ b In 7 

T-L — '^^+EY = ^ 2- + E-^- 
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For 6 > 21 it holds that 




Inj 



J 



and therefore 




In j In^ S 



Hence, A > 4 yields the desired bound on ET^. The bound on VarT^ is 
shown by similar techniques. 

It remains to notice that the probability of each of the four regularity 
conditions to be true for dnew tends to one as A — >■ oo. Indeed, the 
expectations and variances calculated above easily show this fact. 



By Theorem 12.51 we can conclude that with high probability there exists 
a document in V that matches the prefix of length q — 7 of d„ew , whereas the 
probability to find a document that has more than (1 + e)q common terms with 
dnew (f^t arbitrary positions) is quite small. Therefore, it suffices to determine 
a document that has a maximal common prefix with the query document. This 
fact, however, allows to sort the documents according to their sorted term list^ 
and then perform a binary search based on the sorted term list of the query 
document (Figure [2]) . The running time is as follows (for the average cas^j 



Preprocessing: 

1. For every document: Sort the term list according to the position 
of the terms in the frequency table. 

2. Sort the documents according to their sorted term lists. 

Query: Find a document having the maximal common prefix with the 
query document by binary search. 



analysis we assume that the length of term lists is logm G 0{logn), for the 
worst case analysis the length is m): 

^In a sorted term list the terms of a document are ordered according to the position of the 
terms in the frequency table. 

•^Only for the average case our constraints from Section [T] are preserved. 



□ 



Figure 2: Maxint algorithm in the Zipf model 



11 



Figure 3: Hierarchical scheme 





average case 


worst case 


Step 1 
Step 2 
Query 


0{n ■ log n ■ log log n) 
0{n-log^n) 
O{\og^ n) 


0{n ■ m ■ logn) 
0{n ■ log n ■ m) 
©(logn • m) 



The log factor in the query step is due to the fact that the algorithm performs 
a binary search on the document collection. Since in the average case the length 
of a term list is ©(logn), we get another log factor resulting in a query time 
ofO(log n). One can try to improve the accuracy of our algorithm by finding 
a "maximal prefix with at most one difference to the query document" . A 
recent technique called "indexing with errors" [BJ [TB] might be useful for such 
an extension. 

3 Maxint in the Hierarchical Scheme 

Our second model is motivated by the observation that in many existing appli- 
cations terms can be ordered hierarchically. Let fc > 8 be an integer and let 7" 
be a set of (2*^ — 1) ■ k different terms. A document collection I?^. consists of 2*^ 
sets where every set d g 1?^ is a subset of T with |d| = fc. A hierarchical scheme 
is a table with fc levels, level 1 to level fc. Level i, 1 < i < fc, is divided into 2^~^ 
cells, cell d^i to cell Ci^2^-i. For 2 < Z < fc we say that cell Ci-ij, 1 < j < 2'~^, 
is above cell Cij', 1 < j' < 2'^^, if [j'/^l = j- Every cell contains fc terms. A 
document collection based on this scheme can be generated as follows: Every 
document is generated independently. Choose a random cell on level fc and 
mark it. Then, for I = {fc, . . . , 2}, mark on level I — 1 the cell that is above the 
already marked cell on level I. Now choose one random term in every marked 
cell. The so defined set of terms form a document of our collection. Note that 
every document generated by this process corresponds to a unique sequence of 
cells. We'll call such a sequence a cell path. There exists a natural ordering on 
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these cell paths where the cells Ci,i, C2,i, . . . , Ck,i describe the leftmost path, 
the cells Ci,i, C2,2, ■ • ■ , C'fc,2'=-i accordingly the rightmost one (see Figure [3]). 

Remark 4. We claim that the hierarchical scheme follows Zipf's law. To be more 
precise, the following holds: For every level the product of expected frequency 
and expected frequency rank of a term is the same. Indeed, the expected fre- 
quency of a term on level i is given by the formula 2'^/ (2*~^ • k). The expected 
rank of a term is given by the formula (2*^^ — 1) • fc + 2*^^ • k. Hence, the product 
between frequency and frequency rank (divided by 2'"') is equal to 

?-2-i-lV4e[0.5,1.5), 



2^-1 -fc V2 / S*^' 

which means it lies in a fixed interval and therefore follows the idea of Zipf's 
law. 

This time we introduce two matching levels to give statements about the 
most probable size of a maximal intersection. The matching levels are 

and q' 



1 + log k log k 

Remark 5. Again, to keep proofs more legible, we do not demand that the 
length of a prefix or the matching size is an integer (except part one of the 
following theorem) . But clearly, in real settings these values have to be rounded 
appropriately. 

Theorem 3.1 (Matching Levels for Hierarchical Scheme). Let k > 8 be an 

integer and 2 < j < q. Let Vk be a document collection following the hierarchical 
scheme. 

1. (Prefix match). The probability that there exists a document d £ V^. that 
matches the first [q — j\ terms (i.e. the terms from level 1 to level lq — ^\) 
of a query document dnew following the hierarchical scheme is greater than 
1 _ 2-(2'=)\ 

2. (Any match). The probability that there exists a document d € T>k that 
matches at least q' + j terms of a query document dnew following the 
hierarchical scheme is smaller than 2/k'^~^. 



Proof. 1 . The number of different prefixes of length [g — 7J is at most 

k{2k)'^^^^^ < 2(^+'°s'^)('~''^ = 2(^+'°s'')(''/(^+'°sfc)-7) — 2'^ • (2k)~'^. 

So the probability that a new document does not match a prefix of length 
[q — 7J with any document from T>k is smaller than 



(2fc)'^^ ^' 



< e 
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For fc > 8 and 7 < g it holds that {2k)'' < 2'' and therefore the above 
inequahty follows from inequality (|2.ip . We get that the probability that 
there exists a document in Vk with the same prefix as dnew of length 
[q — 7J is greater than 1 — 2"'^'=^''. 



2. Let t > g' + 7 be the last position where the terms of d and dnew match. 
We want to estimate the probability that dnew matches at least g' + 7 
terms at arbitrary positions with d. The probability that the first t terms 
(beginning at level 1) of d and dnew are all in the same cells is 2^~*. The 
probability that at least + 7 terms are matched on some fixed positions 



is no more than (1/fc) 



q +7 



((fc-l)/fc) 



t-q'- 



An upper bound for the 



number of different possibilities of matching at least g' + 7 out of t terms 
is 2*. Since the factor ((fc — l)/fc)*^'^ is smaller than 1, overall we get 
that the probability that dnew matches at least 5' + 7 terms at arbitrary 
positions with d is smaller than 



?'+7 



■2' 



q'+'Y 



-G 



The factor k in the above equation is due to the fact that we need to 
consider all possible levels for the last matched position t. Now the prob- 
ability that no document matches at g' + 7 arbitrary positions with dnew 
is at least 



-G 



h7-l 



2fc 



1 - 



2k . 



> 



1 - 



k-r- 



which follows from inequality (12.2^ . since 7 > 2 and fc > 8. So the prob- 
ability that there exists a document in Vk that matches at least + 7 
terms of dnew is smaller than 2/k''~^. 

□ 

Theorem 13.11 yields that also for the hierarchical scheme it suffices to search 
a document that has a maximal common prefix with the query document. The 
resulting algorithm is analogue to the one for the Zipf model and summarized 
on Figure S) 

The running time is shown in the table below. Note that for the hierarchical 
scheme we only perform a worst case analysis since every document has equal 
length. 





average case 


worst case 


Step 1 




e'(2'=-fc-logfc) 


Step 2 




0{2^ ■ fc2) 


Query 




0(fc2) 



14 



Preprocessing: 

1. For every document: Sort the term list according to the hier- 
archical scheme, i.e. according to the levels in which the terms 
appear. 

2. Sort the documents according to their corresponding cell paths, 
i.e. documents that correspond to the leftmost path in the 
scheme are at the beginning of the sorted list. Documents that 
correspond to the same cell path are sorted lexicographically. 

Query: Find a document having the maximal common prefix with the 
query document by binary search. 



Figure 4: MzLxInt algorithm in the hierarchical scheme 

4 Further Work 

In this paper we have shown that assumptions on the random nature of the 
input can lead to provable time and accuracy bounds for MaxInt. Also, we 
have discovered a MaxInt threshold phenomenon in two randomized models. 

The next step is to understand it better. Does it hold for other randomized 
models from |18j, especially for generalized random graphs with a power-law 
degree distribution? Does it hold in the real life networks? Can we introduce 
randomized models for sparse vector collections and find a similar effect there? 
It was observed that tractable instances of nearest neighbors have small intrinsic 
dimension [SJ [5J [HJ |T2] . Does the same effect hold for the Zipf model and the 
hierarchical scheme? Of course, the most challenging problem is to find an 
exact algorithm for MaxInt (preserving our time constraints) or to prove its 
hardness. What are other particular cases or assumptions that have efficient 
MaxInt solutions? On the other hand, we have a very particular subcase for 
which we still do not believe in a positive solution. Hence, we ask for a hardness 
proof for the following on-line inclusion problem. 



On-line Inclusion Problem 

Database: A family T of 2^^ subsets of [1 . . . fc^]. 

Query: Given a set fnew ^ [1 . . . fc^], decide whether there exists an / G 

such that fnew Q /• 

Constraints: Space for preprocessed data 2*^ • poly{k). 
Query time poly{k). 



Note that we have a constraint on space for preprocessing, not time. A related 
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problem but with a much stronger restriction on preprocessing space was proven 
to be hard by Bruck and Naor [3]. 

Our algorithm in Section |3] uses polylogarithmic time (in the number of 
documents) but it returns only an approximate solution with high probabihty 
(not every time). Can we get an optimal solution or at least a guaranteed 
approximation by relaxing the time constraint to expected polylogarithmic time? 

The maximal intersection problem is a special case of a whole family of 
problems called Strongest Connection Problems (SCP) which covers all problems 
fitting the following framework. Consider some class of graphs G and some class 
of paths v. 



Strongest Connection Problems 

Database: A graph G ^ Q. 

Query: Given a (new) vertex v (together with edges connecting v with G), 
return a vertex u £ G with a maximal number of 7^-paths from v to 
u. 

Constraints: Preprocessing time o(|Gp). 
Query time o(|G|). 



A number of well-motivated problems fall into the family of SCP. Some of 
them are listed below but the application of the SCP framework is not limited 
to these instances. 

Recommendations. Q: bipartite graphs; V: paths of length three. 

Example: A graph G is partitioned into vertices representing people and 
books, and the edge relation describes who has bought which books. Given 
a person v. Which is a book (not purchased by v) that is most often bought 
by people that had been interested in books of vl 

Similarity in folksonomies. Q: tripartite 3-graphs where the set of vertices 
is partitioned into (Vb, Vi, V2) and E is an edge relation in Vb x Fi x V2; 
V: paths consisting of two edges that overlap either in Vb or in Vi or both 
in Vb and Vi. 

Example: A graph representing all events of kind "a user U used a tag T to 
label a website W" . Then, the strongest connection query for the tag T^ew 
returns another tag that was most often co-used by the same users and/or 
applied to the same websites. Such tripartite 3-graph is accumulated in 
I5el ■ icio .usl 
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